Authors: Hamilton Hewlett & Laura Sikes, MPH, CPH
Group: Not Sure
Instructor: Dr. Samantha Seals
Date: November 19, 2024
\[ \hat{y} = \beta_0 + \beta_1 x \]
In the regression line equation \[ \hat{y} = \beta_0 + \beta_1 x \] where the intercept \(\beta_0\) is \[ \beta_0 = \frac{\sum y}{n} - \beta_1 \frac{\sum x}{n} \] and where the slope \(\beta_1\) is \[ \beta_1 = \frac{n \sum x y - \sum x \sum y}{n \sum x^2 - \left(\sum x\right)^2} \]
Based on the above, predictions can be made about what would happen along the regression line where there are no data points.
Assumptions (Zambesi, 2024):
\(E\) expected value of error term is zero
All error terms have the same variance
Normality
Independent variables
\[ Q_Y(\tau \mid X) = X \beta(\tau) \]
Watch outs:
The more quantiles that are added, the smaller your sample size becomes. When you break your data into these sections you are actually breaking the sample population up and only looking at what fits in the given quantile. This can potentially cause validity issues if questioned about sample size (Lê Cook & Manning, 2013).
You also may run into crossing which shouldn’t happen but seems inevitable. It is a worry that would make you think this method should be deemed invalid. Think of 4 equations across a data set split up between the varying sections within the data. Knowing that unless a line is perfectly parallel with another, intersection is deemed to happen at some point. A non crossing model is attainable (Jiang & Yu, 2023).
The formula for R2 is as follows:
\[
R^2 = 1 - \frac{\sum \left(y_i - \hat{y}\right)^2}{\sum \left(y_i - \bar{y}\right)^2}
\] - The sum of observed values \(y_i\) is subtracted by the sum of fitted values \(\hat{y}\) and squared, forming the numerator.
- The sum of observed values \(y_i\) minus the sum of sample mean \(\bar{y}\) and then squared, forming the denominator.
- This formula may also be represented as \[
R^2 = 1 - \frac{RSS}{TSS}
\] where RSS is the residual sum of squares and TSS is the total sum of squares.
- The post-hoc testing methods Koenker and Machado proposed are Δn(τ), Wn(τ), and Tn(τ), which the authors claimed would vastly enhance the capacity for quantile regression inference.
- The theory behind these methods is beyond the scope of this paper.
Waldmann (2018) utilized body mass index (BMI) data to illustrate the usefulness of QR.
The author’s goal was to answer questions about the outer portions of the distribution, which in this case was boys with the highest BMI.
QR afforded the author insight into how predictor variables can have stronger or weaker effects within different areas of the distribution.
Cohen et al. (2016) used QR to evaluate the relationships between community demographics and perceived resilience.
The authors divided the distribution into percentiles and quartiles, and found that significant relationships existed inside several quantiles that were not significant when using linear regression on the entire distribution.
Staffa et al. (2019) employed QR to understand relationships between ventilator dependence and hospital length of stay, suspecting that these relationships changed across different quantiles. -The authors found stronger relationships between the predictor variables inside the upper quantiles of the length of stay outcome.
Machado and Silva (2005) explored the use of QR for count data by incorporating artificial smoothness, allowing inferences to be made about the conditional quartiles. The authors note that count data is often severely skewed, making it an excellent candidate for QR which is more resilient against skewness given its use of medians over means.
Le Cook and Manning (2013) showed that through quantile regression, raising taxes on alcohol did not reduce consumption rates of heavy alcohol users as it was intended. This study broke users into three groups, light, moderate, and heavy drinkers. Results using quantile regression showed that the light and heavy drinkers were not sensitive to the increase in price. Study showed that it only deterred moderate drinkers.
Le Cook and Manning (2013) also discuss how quantile regression is necessary when observing health care expenditures. They show that users can be more effectively analyzed when broken into sub groups depending on healthcare usage. This sheds more light on user’s status such as race, employment, gender, and insurance status among others. This is far more effective in analyzing data rather than regression based off the mean. This will not tell you the full story.
DIABETE4 is a variable used for participants to share if they have ever been told they have diabetes. The acceptable responses were 1 (Yes), 2 (Yes, but female told only during pregnancy), 3 (No), 4 (No, pre-diabetes or borderline diabetes), 7 (Don’t know/not sure), 9 (Refused). For the sake of this analysis, we stratified everything by 1 (Yes).
WEIGHT2 is a variable for participants to share how much they weigh without shoes on. The responses were acceptable 50-0776 (Weight in pounds), 7777 (Don’t know/Not sure), 9023 - 9352 (Weight in kilograms), 9999 (Refused), and BLANK (Not asked or Missing).
HEIGHT3 is a variable that allowed participants to share their height without shoes on. The acceptable responses recorded that I used are imperial. The metric responses were much larger and made sense to be metric.
DIABAGE4 is a variable that allows participants to share how old they were when told they first had diabetes. The acceptable responses are 1-97 (Age in years), 98 (Don’t know/Not sure), 99 (Refused), and BLANK (Not asked or Missing)
To ensure smooth data acquisition. I ensured that this variable is only present in the stratified DIABETE4 (1) in the Table 1.
| Stratified by DIABETE4 | ||||||
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 7 | 9 | |
| n | 1952 | 78 | 9764 | 222 | 28 | 2 |
| DIABAGE4 (mean(SD)) | 54.39 (18.62) | - | - | - | - | - |
| WEIGHT2 (mean(SD)) | 199.36 (50.47) | 164.17 (36.29) | 177.37 (45.32) | 196.36 (51.12) | 179.36 (55.79) | 153.00 (62.23) |
| HEIGHT3 (mean(SD)) | 610.68 (837.91) | 1115.53 (2145.01) | 709.05 (1238.72) | 738.83 (1335.65) | 508.54 (31.71) | 506.00 (1.41) |
Figure 1.
Figure 2.
Started out considering only Age ~ Weight
Expanded into Age ~ Weight + Height
Created models using rq() from the quant reg package.
Found coefficients to create regression lines
Plotted results using ggplot2
Created confidence intervals from scratch due to vcov error when running rq() models and trying to use the confint() function.
Table 1.
| MM25 | MM50 | MM75 | ||
|---|---|---|---|---|
| Height | ||||
| Coefficient | 0.00501 | -0.00296 | 0.00466 | |
| Upper 95% Confidence Interval | 0.028 | 0.018 | 0.03 | |
| Lower 95% Confidence Interval | -0.018 | -0.024 | -0.021 | |
| P-Value | 0.66932 | 0.78491 | 0.71881 | |
| Weight | ||||
| Coefficient | -0.02405 | -0.04386 | -0.0634 | |
| Upper 95% Confidence Interval | -0.016 | -0.03 | -0.048 | |
| Lower 95% Confidence Interval | -0.032 | -0.057 | -0.079 | |
| P-Value | 0 | 0 | 0 |
Diabetics age informed of diagnosis as a function of weight and height.
Table 2a.
| Category | n = 1319 | % |
|---|---|---|
| Self-Assessed General Health | ||
| Excellent | 42 | 3.3 |
| Very Good | 205 | 16.3 |
| Good | 383 | 30.4 |
| Fair | 424 | 33.6 |
| Poor | 265 | 21.0 |
| Poor Physical Health Days / Mo. | ||
| 1-7 | 537 | 42.6 |
| 8-14 | 168 | 13.3 |
| 15-21 | 212 | 16.8 |
| >21 | 402 | 31.9 |
Table 2b.
| Category | n = 1319 | % |
|---|---|---|
| Poor Mental Health Days / Mo. | ||
| 1-7 | 481 | 38.1 |
| 8-14 | 210 | 16.7 |
| 15-21 | 247 | 19.6 |
| >21 | 381 | 30.2 |
| Poor Health Days Affected ADL / Mo. | ||
| 1-7 | 549 | 43.5 |
| 8-14 | 191 | 15.1 |
| 15-21 | 247 | 19.6 |
| >21 | 332 | 26.3 |
Table 2c.
| Category | n = 1319 | % |
|---|---|---|
| Education | ||
| Less than High School | 104 | 8.2 |
| High School or GED | 345 | 27.4 |
| Some College | 450 | 35.7 |
| 4 Years College or More | 420 | 33.3 |
Table 2d.
| Category | n = 1319 | % |
|---|---|---|
| Annual Income | ||
| ≤ $24,999 | 413 | 32.8 |
| $25,000 - $34,999 | 220 | 17.4 |
| $35,000 - $49,999 | 183 | 14.5 |
| $50,000 - $74,999 | 192 | 15.2 |
| $75,000 - $99,999 | 146 | 11.6 |
| ≥$100,000 | 165 | 13.1 |
| Sex | ||
| Male | 493 | 39.1 |
| Female | 826 | 65.5 |
Table 3.
| Quartile | Poor ADL Days Coefficient | 95% Lower CI | 95% Upper CI | p-Value |
|---|---|---|---|---|
| 0.25 | 0.005 | -0.002 | 0.012 | 0.158 |
| 0.50 | 0.000 | -0.013 | 0.013 | 1.000 |
| 0.75 | 0.032 | -0.004 | 0.068 | 0.078 |
Note. \(\alpha\) = 0.05
Figure 3.
Days per Month ADLs Affected by Poor Mental or Physical Health as a Function of Weight (lbs)
The predictor variable weight turned out to be a significant predictor. The p value was less than alpha. This was true in all quantiles.
The predictor variable height turned out not to be a significant predictor. The p value was greater than alpha. This was true in all quantiles.
It seems that weight is a significant predictor for determining the age you are informed you have diabetes based off my models, and the BRFSS data. Height however, seems to not be a significant predictor for the age you are informed you have diabetes.
We could also see in the MM25 model (0.25 quantile) the responses of refusal and don’t know were skewing that regression line.
Table 3.
| Quartile | Poor ADL Days Coefficient | 95% Lower CI | 95% Upper CI | p-Value |
|---|---|---|---|---|
| 0.25 | 0.005 | -0.002 | 0.012 | 0.158 |
| 0.50 | 0.000 | -0.013 | 0.013 | 1.000 |
| 0.75 | 0.032 | -0.004 | 0.068 | 0.078 |